Use this template to complete your project throughout the course. Your Final Project presentation will be based on the contents of this document. Replace the title/name above and text below with your own, but keep the headers. Feel free to change the theme and other display settings, but this is not required.
My project will be a limited meta-analysis of existing bulk RNAseq data sets detailing the differentiation of induced-pluripotent stem cells (IPSCs) into microglia, a key population of resident immune cells in the central nervous system (CNS). IPSC-based modeling is the future of microglia biology in vitro, thus there are many competing standards - eg different IPSC-derived microglia produced by different groups. My analysis will compare the transcriptomes of these competing microglia to a “Gold Standard” or reference transcriptomic dataset generated using microglia isolated from post-mortem biopsies from human donors.
Project Repository: https://github.com/gesualdiJ/BMIN503_Final_Project
Microglia have long been challenging to model in-vitro using conventional approaches such as immortalized cell lines or primary culture because their morphology, function, and (more recently appreciated) transcriptome rapidly deteriorate when cultured in isolation. To address this issue, numerous groups have developed microglial cell lines derived from IPSCs through supplementation of cultures with various factors normally present in the micro-encironment of the CNS. The goal of this project is to interrogte the merit of several published IPSC-derived microglia lines along with some more recently developed applications of these cells involving chimeric animal model systems.
Addressing this problem involves suject matter knowledge of both immunology and neuroscience, as well as various bioinformatic techniques. This analysis will include only a subset of published microglial lines, prioritizing those models that myself and Drs Schaletzki and Su considered influential. The reference transcriptomes - one from adult donors and one generated from analysis of pediatric tissue - add rigor to this project because they represent the closest that we can realistically hope to get to the in vivo gene expression of microglia. The microglia that were analyzed to generate these reference transcriptomes were isolated immediately after surgical resection from patients, therefore there has been minimal time for these cells to lose their in vivo character. Finally, this analysis will be useful to microglia biologists because while many transcriptomic data sets exist, there have been few if any comprehensive meta-analyses of these data and still fewer attempts to harmonize their insights. This project aims to make progress in that direction.
Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.
#install and load packages—-
install.packages("BiocManager")
BiocManager::install(c("GEOquery",
"oligo",
"limma",
"hgu133plus2.db",
"pd.hg.u133.plus.2",
"viridis",
"fgsea",
"edgeR",
"gplots"))
install.packages("tidyverse")
install.packages("reshape2")
install.packages("edgeR")
install.packages("readxl")
install.packages("ggplot2")
install.packages("fgsea")
install.packages("stringr")
install.packages("readr")
install.packages("purrr")
install.packages("plotly")
install.packages("gt")
install.packages("magrittr")
BiocManager::install("ensembldb")
BiocManager::install("AnnotationHub")
BiocManager::install("rhdf5")
BiocManager::install("mixOmics")
BiocManager::install('EnhancedVolcano')library(BiocManager)
library(tidyverse)
library(reshape2)
library(rhdf5)
library(edgeR)
library(readxl)
library(GEOquery)
library(oligo)
library(limma)
library(viridis)
library(pd.hg.u133.plus.2)
library(hgu133plus2.db)
library(ggplot2)
library(gplots)
library(fgsea)
library(stringr)
library(readr)
library(ensembldb)
library(AnnotationHub)
library(purrr)
library(plotly)
library(mixOmics)
library(gt)
library(EnhancedVolcano)
library(magrittr)#Gosselin Data, Fetal Ex-Vivo Microglia (Reference) ----
#This Reference dataset contains 24 ex-vivo microglia samples from
#pediatric patients isolated following surgical resection
#The number of observations (expressed genes) for this dataset is 14,528
#The processed data frames containing these data are stored in tbl's
#"ExVivo_Juvenile_MG_Reference_Gosselin" (no GeneID column, gene names are row names)
#or "Gosselin_log2_cpm_filtered_norm" (maintains GeneID as a separate column)
#These data are published but the .xlsx file containing the raw counts,
#sample labels, and gene names were sent to me by a collaborator
#Therefore, I can just read the counts into a data frame and begin cleaning,
#filtering, and normalizing without having to access any
#Transcriptomic repositories or databases. That makes the processing for this data
#set simpler but less generalizeable. The following
#Datasets make use of either EnsembleDB or ArchS4 and are
#therefore more informative for directly accessing published data without any personal connections.
#Read in raw counts
microglia_fetal_ExVivo_Gosselin <- read_xlsx("Human_RNA-seq_data_Gosselin_Fetal_ExVivo.xlsx", sheet = 1, skip = 1)
microglia_fetal_ExVivo_Gosselin <- dplyr::rename(microglia_fetal_ExVivo_Gosselin, Gene_ID = ...1)
#remove any duplicated genes symbols (2) and ove gene labels from column to rowname
microglia_fetal_ExVivo_Gosselin <- microglia_fetal_ExVivo_Gosselin[!duplicated(microglia_fetal_ExVivo_Gosselin$Gene_ID), ]
microglia_fetal_ExVivo_Gosselin <- column_to_rownames(microglia_fetal_ExVivo_Gosselin, var = "Gene_ID")
#Removes 40 columns of unneeded monocyte/in vitro/lysate data
microglia_fetal_ExVivo_Gosselin <- microglia_fetal_ExVivo_Gosselin %>%
dplyr::select(!contains("InVitro") & !contains("Monocytes") & !contains("Lysate"))
sample_Labels_Gosselin <- colnames(microglia_fetal_ExVivo_Gosselin)
#Convert to DGE list to facilitate normalization and count visualization
#Then get counts per million (cpm) to visualize unexpressed / lowly expressed probes
DGEList_Gosselin <- DGEList(microglia_fetal_ExVivo_Gosselin)
Gosselin_cpm <- cpm(DGEList_Gosselin$counts)
#Table shows number of non-expressed genes across all samples, indicates roughly how many should be filtered out
table(rowSums(DGEList_Gosselin$counts==0)==24) ##
## FALSE TRUE
## 22061 2750
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
#reshape2::melt() is a method that transforms tbl.df's from a wide or user friendly format
# to a 'tall' or processing friendly format. see below
Gosselin_log2_cpm <- as_tibble(cpm(DGEList_Gosselin$counts, log = TRUE) , rownames = "GeneID")
Gosselin_log2_cpm_melt <- melt(Gosselin_log2_cpm)
head(Gosselin_log2_cpm_melt)## GeneID variable value
## 1 A1BG RNA_HMG002_S002_Microglia_ExVivo -0.7055321
## 2 A1BG-AS1 RNA_HMG002_S002_Microglia_ExVivo 0.2073467
## 3 A1CF RNA_HMG002_S002_Microglia_ExVivo -2.9884547
## 4 A2M RNA_HMG002_S002_Microglia_ExVivo 13.0833842
## 5 A2M-AS1 RNA_HMG002_S002_Microglia_ExVivo 7.2165096
## 6 A2ML1 RNA_HMG002_S002_Microglia_ExVivo -1.5456431
#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Gosselin_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Gosselin, unfiltered, non-normalized",
caption=paste0("produced on ", Sys.time()))## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
#create logical vector that evaluates to TRUE when at least 6 of 24 samples express a given gene
filtrate_gosselin <- rowSums(Gosselin_cpm > 1) >= 6
table(filtrate_gosselin)## filtrate_gosselin
## FALSE TRUE
## 10283 14528
#subset DGE list so that only genes expressed in at least 6 samples remain
#the get counts per million / reshape to re-visualize distribution of expression without unexpressed probes
DGEList_Gosselin_filtered <- DGEList_Gosselin[filtrate_gosselin, ]
Gosselin_log2_cpm_filtered <- as_tibble(cpm(DGEList_Gosselin_filtered, log = TRUE), rownames = "GeneID")
Gosselin_log2_cpm_filtered_melt <- melt(Gosselin_log2_cpm_filtered)
ggplot(Gosselin_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Gosselin, filtered, non-normalized",
caption=paste0("produced on ", Sys.time()))#TPM normalization of filtered DGE list, then store data in tbl.df's
DGEList_Gosselin_filtered_norm <- calcNormFactors(DGEList_Gosselin_filtered, method = "TMM")
Gosselin_log2_cpm_filtered_norm <- as_tibble(cpm(DGEList_Gosselin_filtered_norm, log = TRUE), rownames = "GeneID")
ExVivo_Juvenile_MG_Reference_Gosselin <- column_to_rownames(Gosselin_log2_cpm_filtered_norm, var = "GeneID")
#clean up lengthy sample labels for this dataset. haters will suggest rewriting with vapply. I will never learn vapply!
labels_Gosselin <- rep(NA, 24)
i <- 1
while (i <= 24){
labels_Gosselin[i] = paste0(paste0("J", substring(sample_Labels_Gosselin[i], 16, 32)), as.character(i))
i = i + 1
}
colnames(ExVivo_Juvenile_MG_Reference_Gosselin) <- labels_Gosselin
colnames(Gosselin_log2_cpm_filtered_norm)[2:ncol(Gosselin_log2_cpm_filtered_norm)] <- labels_GosselinThe plots generated above, as well as the subsequent ones for future datasets, show that a high density of probes are either very lowly expressed or not expressed at all across all samples in the dataset. This is typical as bulk sequencing reads an extremely high number of probes. It is standard procedure to filter these unexpressed probes out, leading to the more uniform distribution observed in the second violin plot.
#Eggen Data, Adult Ex-Vivo Microglia (Reference)----
##This processed Reference dataset will contain 39 total samples
#(all Ex Vivo microglia surgically resected from adult patients)
#The number of observations (expressed genes) for this dataset is 14,397
#The processed data frames containing these data are stored in tbl's
#"ExVivo_Adult_MG_Reference_Eggen" (no GeneID column, gene names are row names)
#or "Eggen_log2_cpm_filtered_norm" (maintains GeneID as a separate column)
#This dataset was accessed by directly downloading the tsv containing the
#raw counts from GEO. However, these data contained only
#transcript IDs instead of gene names like the other datasets. Therefore,
#I needed to use AnnotationHub() to access the Ensemble Data base
#in order to map the transcript IDs to the appropriate gene names.
#This is comparable to the use of hgu133plus2.db in Practicum 9,
#But is optimized for Bulk RNA sequencing data rather than MicroArrays,
#which are somewhat out of date (no shade)
#This is an alternative approach to using the ArchS4 database, which I employ for
#the next 3 datasets. This Method was mainly included to showcase alternate approaches
#to access GEO data without downloading and reading in the large (16 GB!) Archs4 database
#This is also a more beginner friendly approach that allows the user to avoid working with
#the somewhat cumbersome HDF5 files that ArchS4 uses to organize data
ah <- AnnotationHub()
ahDb <- query(ah, pattern = c("Homo Sapiens", "EnsDb"))
#Query for available H.Sapiens EnsDb databases, using for mapping probe id's to gene names
EndDB_Hs <- ahDb[["AH104864"]]
EnsDB_Hs <- EndDB_Hs
EnsDB_Hs_DF <- genes(EnsDB_Hs, return.type = "data.frame") %>%
dplyr::select(gene_id, gene_name)
EnsDB_Hs_DF <- dplyr::rename(EnsDB_Hs_DF, Tx_ID = gene_id)
#read in data, will need to add gene names using EnsDB information
microglia_adult_ExVivo_Eggen <- read_tsv("GSE99074_HumanMicrogliaBrainCounts.txt")
microglia_adult_ExVivo_Eggen <- dplyr::rename(microglia_adult_ExVivo_Eggen, Tx_ID = ...1)
#subset list of matched tx_id's and gene names to only include tx's present in dataset
EnsDB_Hs_DF <- dplyr::filter(EnsDB_Hs_DF, EnsDB_Hs_DF$Tx_ID %in% microglia_adult_ExVivo_Eggen$Tx_ID)
microglia_adult_ExVivo_Eggen <- inner_join(microglia_adult_ExVivo_Eggen, EnsDB_Hs_DF, by = "Tx_ID")
microglia_adult_ExVivo_Eggen <- relocate(microglia_adult_ExVivo_Eggen, gene_name, .before = spm09)
microglia_adult_ExVivo_Eggen <- microglia_adult_ExVivo_Eggen[ , 1:67]
microglia_adult_ExVivo_Eggen <- dplyr::rename(microglia_adult_ExVivo_Eggen, GeneID = gene_name)
#5116 transcripts ID's for which there is no corresponding gene name, we remove using dplyr::filter
table(microglia_adult_ExVivo_Eggen$GeneID == "") ##
## FALSE TRUE
## 22721 5116
microglia_adult_ExVivo_Eggen <- dplyr::filter(microglia_adult_ExVivo_Eggen, GeneID != "")
microglia_adult_ExVivo_Eggen <- microglia_adult_ExVivo_Eggen[ , 2:67]
microglia_adult_ExVivo_Eggen <- microglia_adult_ExVivo_Eggen[1:40]
microglia_adult_ExVivo_Eggen <- microglia_adult_ExVivo_Eggen[!duplicated(microglia_adult_ExVivo_Eggen$GeneID), ]
microglia_adult_ExVivo_Eggen <- column_to_rownames(microglia_adult_ExVivo_Eggen, var = "GeneID")
#Create DGElist, find counts per million, visualize distribution of data, and filter out unexpressed probes
DGEList_Eggen <- DGEList(microglia_adult_ExVivo_Eggen)
Eggen_cpm <- cpm(DGEList_Eggen)
table(rowSums(DGEList_Eggen$counts==0)==20) ##
## FALSE TRUE
## 22270 237
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
Eggen_log2_cpm <- as_tibble(cpm(DGEList_Eggen, log = TRUE) , rownames = "GeneID")
Eggen_log2_cpm_melt <- melt(Eggen_log2_cpm)
#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Eggen_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Eggen, unfiltered, non-normalized",
caption=paste0("produced on ", Sys.time()))#create logical vector that evaluates to TRUE when at least 20 of 39 samples express a given gene
filtrate_eggen <- rowSums(Eggen_cpm > 1) >= 20
table(filtrate_eggen)## filtrate_eggen
## FALSE TRUE
## 8110 14397
#subset DGE list so that only genes expressed in at least 24 samples remain, remove 8110 unexpressed genes
#then reformat to log2 counts per million to reshape and revisualize distribution of data
DGEList_Eggen_filtered <- DGEList_Eggen[filtrate_eggen, ]
Eggen_log2_cpm_filtered <- as_tibble(cpm(DGEList_Eggen_filtered, log = TRUE), rownames = "GeneID")
Eggen_log2_cpm_filtered_melt <- melt(Eggen_log2_cpm_filtered)
ggplot(Eggen_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Eggen, filtered, non-normalized",
caption=paste0("produced on ", Sys.time()))#TPM normalization, store proccessed data as tbl.df
DGEList_Eggen_filtered_norm <- calcNormFactors(DGEList_Eggen_filtered, method = "TMM")
Eggen_log2_cpm_filtered_norm <- as_tibble(cpm(DGEList_Eggen_filtered_norm, log = TRUE), rownames = "GeneID")
ExVivo_Adult_MG_Reference_Eggen <- column_to_rownames(Eggen_log2_cpm_filtered_norm, var = "GeneID")
#Clean up non-intuitive sample labels.
labels_Eggen <- rep(NA, ncol(ExVivo_Adult_MG_Reference_Eggen))
j <- 1
while (j <= length(labels_Eggen)){
labels_Eggen[j] = paste0("A_Microllia_ExVivo", as.character(j))
j = j + 1
}
colnames(ExVivo_Adult_MG_Reference_Eggen) <- labels_Eggen
#Format maintaining GeneID column, needed for join function
Eggen_log2_cpm_filtered_norm <- rownames_to_column(ExVivo_Adult_MG_Reference_Eggen, var = "GeneID")#KJS Lab internal data, IPSC-Microglia and Monocyte Derived Macrophages, Pre-Processed ----
#This data set was previously prepared for my thesis lab.
#Contains 3 IPSC-Microglia samples and 3 Monocyte-derived Macrophage (MDM) samples
#The total number of observations (expressed genes) for this dataset is 14,139
#Only handling here is adding the prefix "KJS_" to the start of the existing
#sample labels to facilitate downstream aggregation / analysis
#Tbl "KJS_iMG_MDM" contains this dataset and maintains a GeneID column
KJS_iMG_MDM <- read_xlsx("KJS_iMG_MDM_TMM_Normalized.xlsx")
labels_KJS <- c(rep(NA, ncol(KJS_iMG_MDM)))
labels_KJS[1] <- "GeneID"
for(n in 2: ncol(KJS_iMG_MDM)){
labels_KJS[n] <- paste0("KJS_", colnames(KJS_iMG_MDM[n]))
}
colnames(KJS_iMG_MDM) <- labels_KJS
#Abud 2017 data (IPSC-Microglia +/- coculture with Rat Hippocampal neurons)----
#This processed dataset will contain 15 total samples (9 monocultured IPSC-iMg,
#6 iMg cocultured with rat hippocampal neurons. Cocultured iMg are notated 'rHcN')
#The number of observations (expressed genes) for this dataset is 12,968
#The processed data frames containing these data are stored in tbl's "IPSC_Microglia_Abud"
#(no GeneID column, gene names are row names)
#or "DGEList_Abud_filtered_norm" (maintains GeneID as a separate column)
#This dataset, as well as those from the Svoboda and Blurton-Jones Publications
#were accessed and cleaned using the Archs4 database which contains
#All of the samples ever uploaded to the Gene Expression Omnibus Repository
#and includes - among other information - raw counts for samples as
#Well as descriptions of different samples and overall study design.
#This is the most efficient way to access and clean data from GEO,
#as the available metadata typically provides a work around for
#digging into the supplementary information of a given publication
#to find sample information
#bring in downloaded ArchS4 database containing all GEO accessible transcriptomic data
archs4.human <- "/Users/jamesgesualdi/Desktop/BMIN_5030/BMIN503_Final_Project_Local/human_matrix_v11.h5"
h5ls(archs4.human) #List the contents of the loaded Database## group name otype dclass dim
## 0 / data H5I_GROUP
## 1 /data expression H5I_DATASET INTEGER 441356 x 35238
## 2 / meta H5I_GROUP
## 3 /meta genes H5I_GROUP
## 4 /meta/genes chromosome H5I_DATASET STRING 35238
## 5 /meta/genes ensembl_gene_id H5I_DATASET STRING 35238
## 6 /meta/genes gene_biotype H5I_DATASET STRING 35238
## 7 /meta/genes gene_symbol H5I_DATASET STRING 35238
## 8 /meta/genes genes H5I_DATASET STRING 35238
## 9 /meta/genes start_position H5I_DATASET STRING 35238
## 10 /meta info H5I_GROUP
## 11 /meta/info author H5I_DATASET STRING ( 0 )
## 12 /meta/info contact H5I_DATASET STRING ( 0 )
## 13 /meta/info creation-date H5I_DATASET STRING ( 0 )
## 14 /meta/info laboratory H5I_DATASET STRING ( 0 )
## 15 /meta/info version H5I_DATASET INTEGER ( 0 )
## 16 /meta samples H5I_GROUP
## 17 /meta/samples channel_count H5I_DATASET STRING 441356
## 18 /meta/samples characteristics_ch1 H5I_DATASET STRING 441356
## 19 /meta/samples contact_address H5I_DATASET STRING 441356
## 20 /meta/samples contact_city H5I_DATASET STRING 441356
## 21 /meta/samples contact_country H5I_DATASET STRING 441356
## 22 /meta/samples contact_institute H5I_DATASET STRING 441356
## 23 /meta/samples contact_name H5I_DATASET STRING 441356
## 24 /meta/samples contact_zip H5I_DATASET STRING 441356
## 25 /meta/samples data_processing H5I_DATASET STRING 441356
## 26 /meta/samples extract_protocol_ch1 H5I_DATASET STRING 441356
## 27 /meta/samples geo_accession H5I_DATASET STRING 441356
## 28 /meta/samples instrument_model H5I_DATASET STRING 441356
## 29 /meta/samples last_update_date H5I_DATASET STRING 441356
## 30 /meta/samples library_selection H5I_DATASET STRING 441356
## 31 /meta/samples library_source H5I_DATASET STRING 441356
## 32 /meta/samples library_strategy H5I_DATASET STRING 441356
## 33 /meta/samples molecule_ch1 H5I_DATASET STRING 441356
## 34 /meta/samples organism_ch1 H5I_DATASET STRING 441356
## 35 /meta/samples platform_id H5I_DATASET STRING 441356
## 36 /meta/samples readsaligned H5I_DATASET INTEGER 441356
## 37 /meta/samples readstotal H5I_DATASET INTEGER 441356
## 38 /meta/samples relation H5I_DATASET STRING 441356
## 39 /meta/samples series_id H5I_DATASET STRING 441356
## 40 /meta/samples singlecellprobability H5I_DATASET FLOAT 441356
## 41 /meta/samples source_name_ch1 H5I_DATASET STRING 441356
## 42 /meta/samples status H5I_DATASET STRING 441356
## 43 /meta/samples submission_date H5I_DATASET STRING 441356
## 44 /meta/samples taxid_ch1 H5I_DATASET STRING 441356
## 45 /meta/samples title H5I_DATASET STRING 441356
## 46 /meta/samples type H5I_DATASET STRING 441356
#Read in all 441,356 GEO Accession numbers (eg all total human samples on GEO)
all.samples.human <- h5read(archs4.human, name="meta/samples/geo_accession")
#Query the ArchS4 database for the samples I need from the Abud study, based on descriptions in the paper and on the GEO Entry
Abud_Samples <- c("GSM2360252", #10318X2
"GSM2360253", #7028X2
"GSM2360254", #x2-1
"GSM2360255", #x2-2
"GSM2360256", #x2-3
"GSM2360257", #x2-4
"GSM2360283", #HC-1
"GSM2360284", #HC-2
"GSM2360285", #HC-3
"GSM2360286", #HC-4
"GSM2360287", #HC-5
"GSM2360288", #HC-6
"GSM2445478", #n200 -2
"GSM2445479", #n200 -3
"GSM2445480" #n200 -4
)
#Find the indices of the above samples using their accession numbers
Abud_Sample_Locations <- which(all.samples.human %in% Abud_Samples)
genes <- h5read(archs4.human, "meta/genes/genes") #Bring in gene labels provided by ArchS4
#Read in raw counts for my samples of interest
Expression_Abud <- t(h5read(archs4.human, "data/expression",
index=list(Abud_Sample_Locations, 1:length(genes))))
H5close()
rownames(Expression_Abud) <- genes
colnames(Expression_Abud) <- all.samples.human[Abud_Sample_Locations]
#Below, I access some meta-data from the study to verify that the appropriate
#data was extracted from ArchS4, and clean up colnames to more informative labels
Sample_source_name_ch1 <- h5read(archs4.human, "meta/samples/source_name_ch1")
Sample_title <- h5read(archs4.human, name= "meta/samples/title")
Sample_characteristics <- h5read(archs4.human, name="meta/samples/characteristics_ch1")
studyDesign_Abud <- tibble(Sample_title_Abud = Sample_title[Abud_Sample_Locations],
Sample_source_name_ch1 = Sample_source_name_ch1[Abud_Sample_Locations],
Sample_characteristics = Sample_characteristics[Abud_Sample_Locations])
head(studyDesign_Abud) #Just for context## # A tibble: 6 × 3
## Sample_title_Abud Sample_source_name_ch1 Sample_characteristics
## <chr[1d]> <chr[1d]> <chr[1d]>
## 1 x2-4 iPS microglia "stimulus/treatment: None\ttime (hou…
## 2 HC-5 iPS microglia "stimulus/treatment: Rat hippocampal…
## 3 10318X2 iPS microglia "stimulus/treatment: None\ttime (hou…
## 4 7028X2 iPS microglia "stimulus/treatment: None\ttime (hou…
## 5 x2-1 iPS microglia "stimulus/treatment: None\ttime (hou…
## 6 n200 -4 human iPS microglia "stimulus/treatment: None\ttime (hou…
colnames(Expression_Abud) <- studyDesign_Abud$Sample_title_Abud
#Create DGE list and get cpm values for filtering / normalization
DGEList_Abud <- DGEList(Expression_Abud)
Abud_cpm <- cpm(DGEList_Abud)
table(rowSums(DGEList_Abud$counts==0)==15) ##
## FALSE TRUE
## 26664 8574
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
Abud_log2_cpm <- as_tibble(cpm(DGEList_Abud, log = TRUE) , rownames = "GeneID")
Abud_log2_cpm_melt <- melt(Abud_log2_cpm)
#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Abud_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Abud, unfiltered, non-normalized",
caption=paste0("produced on ", Sys.time()))#create logical vector that evaluates to TRUE when at least 4 of 15 samples express a given gene
filtrate_abud <- rowSums(Abud_cpm > 1) >= 4
table(filtrate_abud)## filtrate_abud
## FALSE TRUE
## 22270 12968
#subset DGE list so that only genes expressed in at least 3 samples remain, remove 24,138 unexpressed genes
#Redo cpm / log transformation and reshape / revisualize data
DGEList_Abud_filtered <- DGEList_Abud[filtrate_abud, ]
Abud_log2_cpm_filtered <- as_tibble(cpm(DGEList_Abud_filtered, log = TRUE), rownames = "GeneID")
Abud_log2_cpm_filtered_melt <- melt(Abud_log2_cpm_filtered)
ggplot(Abud_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Abud, filtered, non-normalized",
caption=paste0("produced on ", Sys.time()))#TPM normalization
DGEList_Abud_filtered_norm <- calcNormFactors(DGEList_Abud_filtered, method = "TMM")
DGEList_Abud_filtered_norm <- as_tibble(cpm(DGEList_Abud_filtered_norm, log = TRUE), rownames = "GeneID")
IPSC_Microglia_Abud <- column_to_rownames(DGEList_Abud_filtered_norm, var = "GeneID")
IPSC_Microglia_Abud <- IPSC_Microglia_Abud %>%
dplyr::select(order(colnames(IPSC_Microglia_Abud)))
#clean up non-intuitive sample labels
labels_Abud <- c(rep("IPSC_Microglia_Abud", 2), rep("IPSC_Microglia_Abud_rHcN", 6), rep("IPSC_Microglia_Abud", 7))
labels_Abud_Numbers <- c("1", "2", "1", "2", "3", "4", "5", "6", "3", "4", "5", "6", "7", "8", "9")
for(l in 1:length(labels_Abud)){
labels_Abud[l] <- paste0(labels_Abud[l], labels_Abud_Numbers[l])
}
colnames(IPSC_Microglia_Abud) <- labels_Abud
#reorder again for convenient grouping of now renamed samples,
#Add inproved labels to DGEList_Abud_filtered_norm
IPSC_Microglia_Abud <- dplyr::select(IPSC_Microglia_Abud, order(colnames(IPSC_Microglia_Abud)))
DGEList_Abud_filtered_norm <- rownames_to_column(IPSC_Microglia_Abud, var = "GeneID")#Blurton jones 2019 data (IPSC-Microglia +/- Transplant ("Xenograft") into chimeric mice) ----
#This processed dataset will contain 16 total samples (10 transplanted iMg,
#6 in vitro. Transplanted iMg are notated 'xMG')
#The number of observations (expressed genes) for this dataset is 13,421
#The processed data frames containing these data are stored in tbl's
#"IPSC_Microglia_Blurton" (no GeneID column, gene names are row names)
#or "DGEList_Blurton_filtered_norm" (maintains GeneID as a separate column)
Blurton_Samples <- c("GSM3908536", #MBJ_iMGL_SAL_1
"GSM3908537", #MBJ_iMGL_SAL_2
"GSM3908538", #MBJ_iMGL_SAL_3
"GSM3908539", #MBJ_iMGL_LPS_1
"GSM3908540", #MBJ_iMGL_LPS_2
"GSM3908541", #MBJ_iMGL_LPS_3
"GSM3908542", #MBJ_xMG_GFP_1
"GSM3908543", #MBJ_xMG_GFP_2
"GSM3908544", #MBJ_xMG_GFP_3
"GSM3908545", #MBJ_xMG_GFP_LPS_1
"GSM3908546", #MBJ_xMG_GFP_LPS_2
"GSM3908547", #MBJ_xMG_GFP_LPS_3
"GSM3908548", #MBJ_xMG_GFP_LPS_4
"GSM3908549", #MBJ_xMG_GFP_LPS_5
"GSM3908550", #MBJ_xMG_CDI_1
"GSM3908551", #MBJ_xMG_CDI_2
"GSM3908552", #MBJ_xMG_CDI_3
"GSM3908553", #MBJ_xMG_CDI_4
"GSM3908554", #MBJ_xMG_CDI_5
"GSM3908555", #MBJ_xMG_CDI_6
"GSM3908556", #MBJ_xMG_10C_1
"GSM3908557", #MBJ_xMG_10C_2
"GSM3908558", #MBJ_xMG_10C_3
"GSM3908559", #MBJ_xMG_10C_4
"GSM3908560", #MBJ_ExVivo_1
"GSM3908561", #MBJ_ExVivo_2
"GSM3908562", #MBJ_iMGL_GFP_1
"GSM3908563", #MBJ_iMGL_GFP_2
"GSM3908564", #MBJ_iMGL_GFP_3
"GSM3908565", #MBJ_iMGL_GFP_4
"GSM3908566", #MBJ_iMGL_GFP_5
"GSM3908567" #MBJ_iMGL_GFP_6
)
Blurton_Sample_Locations <- which(all.samples.human %in% Blurton_Samples)
Expression_Blurton <- t(h5read(archs4.human, "data/expression",
index=list(Blurton_Sample_Locations, 1:length(genes))))
H5close()
rownames(Expression_Blurton) <- genes
colnames(Expression_Blurton) <- all.samples.human[Blurton_Sample_Locations]
#Below, I access some meta-data from the study to verify that the appropriate data was
#extracted from ArchS4, and clean up colnames to more informative labels
studyDesign_Blurton <- tibble(Sample_title_Blurton = Sample_title[Blurton_Sample_Locations],
Sample_source_name_ch1_Blurton = Sample_source_name_ch1[Blurton_Sample_Locations],
Sample_characteristics_Blurton = Sample_characteristics[Blurton_Sample_Locations])
head(studyDesign_Blurton)## # A tibble: 6 × 3
## Sample_title_Blurton Sample_source_name_ch1_Blurton Sample_characteristics_B…¹
## <chr[1d]> <chr[1d]> <chr[1d]>
## 1 MBJ_iMGL_SAL_1 Microglia "cell line: AICS-0036 (Co…
## 2 MBJ_iMGL_SAL_2 Microglia "cell line: AICS-0036 (Co…
## 3 MBJ_iMGL_SAL_3 Microglia "cell line: AICS-0036 (Co…
## 4 MBJ_iMGL_LPS_1 Microglia "cell line: AICS-0036 (Co…
## 5 MBJ_iMGL_LPS_2 Microglia "cell line: AICS-0036 (Co…
## 6 MBJ_iMGL_LPS_3 Microglia "cell line: AICS-0036 (Co…
## # … with abbreviated variable name ¹Sample_characteristics_Blurton
colnames(Expression_Blurton) <- studyDesign_Blurton$Sample_title_Blurton
#Remove samples unnecessary for this analysis based on sample characteristics accessed above
Expression_Blurton <- Expression_Blurton[ , c(15:24, 27:32)]
#Create DGE list for filtering / normalizing Blurton Data
DGEList_Blurton <- DGEList(Expression_Blurton)
Expression_Blurton_cpm <- cpm(DGEList_Blurton)
table(rowSums(DGEList_Blurton$counts==0)==18) ##
## FALSE
## 35238
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
Blurton_log2_cpm <- as_tibble(cpm(DGEList_Blurton, log = TRUE) , rownames = "GeneID")
Blurton_log2_cpm_melt <- melt(Blurton_log2_cpm)
#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Blurton_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Blurton-Jones, unfiltered, non-normalized",
caption=paste0("produced on ", Sys.time()))#create logical vector that evaluates to TRUE when at least 6 of 16 samples express a given gene
filtrate_Blurton<- rowSums(Expression_Blurton_cpm > 1) >= 6
table(filtrate_Blurton)## filtrate_Blurton
## FALSE TRUE
## 21817 13421
#subset DGE list so that only genes expressed in at least 6 samples remain, remove 21817 unexpressed genes
#Redo counts per million, log transformation, and reshaping to visualize the filtered dataset
DGEList_Blurton_filtered <- DGEList_Blurton[filtrate_Blurton, ]
Blurton_log2_cpm_filtered <- as_tibble(cpm(DGEList_Blurton_filtered, log = TRUE), rownames = "GeneID")
Blurton_log2_cpm_filtered_melt <- melt(Blurton_log2_cpm_filtered)
ggplot(Blurton_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Blurton-Jones, filtered, non-normalized",
caption=paste0("produced on ", Sys.time()))#TPM normalization
DGEList_Blurton_filtered_norm <- calcNormFactors(DGEList_Blurton_filtered, method = "TMM")
DGEList_Blurton_filtered_norm <- as_tibble(cpm(DGEList_Blurton_filtered_norm, log = TRUE), rownames = "GeneID")
IPSC_Microglia_Blurton <- column_to_rownames(DGEList_Blurton_filtered_norm, var = "GeneID") #Svoboda 2019 data, (IPSC-Microglia +/- Transplant ("Xenograft") into chimeric mice) ----
#This processed dataset will contain 18 total samples of IPSC microglia
#(6 in transplant for 60 days, 6 in transplant for 10 days, and 6 differentiated in vitro (10 days))
#The number of observations (expressed genes) for this dataset is 13,433
#The processed data frames containing these data are stored in tbl's
#"IPSC_Microglia_Svoboda" (no GeneID column, gene names are row names)
#or "DGEList_Svoboda_filtered_norm" (maintains GeneID as a separate column)
Svoboda_Samples <- c("GSM4133389", #iMP1
"GSM4133390", #iMP2
"GSM4133391", #iMP3
"GSM4133392", #iMP4
"GSM4133393", #iMP5
"GSM4133394", #iMP6
"GSM4133395", #In vivo iMG 10dpi rep 1
"GSM4133396", #In vivo iMG 10dpi rep 2
"GSM4133397", #In vivo iMG 10dpi rep 3
"GSM4133398", #In vivo iMG 10dpi rep 4
"GSM4133399", #In vivo iMG 10dpi rep 5
"GSM4133400", #In vivo iMG 10dpi rep 6
"GSM4133401", #In vivo iMG 60dpi rep 1
"GSM4133402", #In vivo iMG 60dpi rep 2
"GSM4133403", #In vivo iMG 60dpi rep 3
"GSM4133404", #In vivo iMG 60dpi rep 4
"GSM4133405", #In vivo iMG 60dpi rep 5
"GSM4133406", #In vivo iMG 60dpi rep 6
"GSM4133407", #In vitro iMG 10dpi rep 1
"GSM4133408", #In vitro iMG 10dpi rep 2
"GSM4133409", #In vitro iMG 10dpi rep 3
"GSM4133410", #In vitro iMG 10dpi rep 4
"GSM4133411", #In vitro iMG 10dpi rep 5
"GSM4133412", #In vitro iMG 10dpi rep 6
"GSM4133413", #In vitro iMG 60dpi rep 1
"GSM4133414", #In vitro iMG 60dpi rep 2
"GSM4133415", #In vitro iMG 60dpi rep 3
"GSM4133416", #In vitro iMG 60dpi rep 4
"GSM4133417", #In vitro iMG 60dpi rep 5
"GSM4133418", #In vitro iMG 60dpi rep 6
"GSM4133419", #In Vivo 2 [M543]
"GSM4133420", #In Vivo 1 [M544]
"GSM4133421", #In Vivo 4 [M1]
"GSM4133422", #In Vivo 3 [M2]
"GSM4133423", #In Vitro 1 [IV1]
"GSM4133424" #In Vitro 2 [IV2]
)
Svoboda_Sample_Locations <- which(all.samples.human %in% Svoboda_Samples)
Expression_Svoboda <- t(h5read(archs4.human, "data/expression",
index=list(Svoboda_Sample_Locations, 1:length(genes))))
H5close()
rownames(Expression_Svoboda) <- genes
colnames(Expression_Svoboda) <- all.samples.human[Svoboda_Sample_Locations]
#Use Meta-data to create study design file and facilitate cleaning
studyDesign_Svoboda <- tibble(Sample_title_Svoboda = Sample_title[Svoboda_Sample_Locations],
Sample_source_name_ch1_Svoboda = Sample_source_name_ch1[Svoboda_Sample_Locations],
Sample_characteristics_Svoboda = Sample_characteristics[Svoboda_Sample_Locations])
colnames(Expression_Svoboda) <- studyDesign_Svoboda$Sample_title_Svoboda
Expression_Svoboda <- Expression_Svoboda[ , 9:26] #remove irrelevant samples
##Create DGE list for filtering and normalization
#perform cpm() and log transformation to visualize
#overall distribution of expression data
DGEList_Svoboda <- DGEList(Expression_Svoboda)
Expression_Svoboda_cpm <- cpm(DGEList_Svoboda)
table(rowSums(DGEList_Svoboda$counts==0)==18) ##
## FALSE TRUE
## 27100 8138
#Create and reshape (melt) tibble of log transformed raw counts to visualized unfiltered / unnormalized data
Svoboda_log2_cpm <- as_tibble(cpm(DGEList_Svoboda, log = TRUE) , rownames = "GeneID")
Svoboda_log2_cpm_melt <- melt(Svoboda_log2_cpm)
#visualize overall expression to show excessive influence of unexpressed / lowly expressed genes
ggplot(Svoboda_log2_cpm_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Svoboda, unfiltered, non-normalized",
caption=paste0("produced on ", Sys.time()))#create logical vector that evaluates to TRUE when at least 6 of 18 samples express a given gene
filtrate_Svoboda<- rowSums(Expression_Svoboda_cpm > 1) >= 6
table(filtrate_Svoboda)## filtrate_Svoboda
## FALSE TRUE
## 21805 13433
#subset DGE list so that only genes expressed in at least 6 samples remain.
#This will 21805 unexpressed genes, I'll then violin plot again to visualize impact of filtering
DGEList_Svoboda_filtered <- DGEList_Svoboda[filtrate_Svoboda, ]
Svoboda_log2_cpm_filtered <- as_tibble(cpm(DGEList_Svoboda_filtered, log = TRUE), rownames = "GeneID")
Svoboda_log2_cpm_filtered_melt <- melt(Svoboda_log2_cpm_filtered)
ggplot(Svoboda_log2_cpm_filtered_melt, aes(x=variable, y=value, fill=variable)) +
geom_violin(trim = FALSE, show.legend = FALSE) +
stat_summary(fun.y = "median",
geom = "point",
shape = 124,
size = 6,
color = "black",
show.legend = FALSE) +
theme(axis.text.x=element_blank()) +
labs(y="log2 expression",
title="Log2 Counts per Million (CPM)",
subtitle="Svoboda, filtered, non-normalized",
caption=paste0("produced on ", Sys.time()))#TPM normalization
DGEList_Svoboda_filtered_norm <- calcNormFactors(DGEList_Svoboda_filtered, method = "TMM")
DGEList_Svoboda_filtered_norm <- as_tibble(cpm(DGEList_Svoboda_filtered_norm, log = TRUE), rownames = "GeneID")
IPSC_Microglia_Svoboda <- column_to_rownames(DGEList_Svoboda_filtered_norm, var = "GeneID")
#Add "Svoboda_" to sample lables to facilitate later analysis and visualization.
labels_Svoboda <- c(rep(NA, ncol(IPSC_Microglia_Svoboda)))
for(m in 1:length(labels_Svoboda)){
labels_Svoboda[m] <- paste0("Svoboda_", colnames(IPSC_Microglia_Svoboda[m]))
}
colnames(IPSC_Microglia_Svoboda) <- labels_Svoboda
colnames(DGEList_Svoboda_filtered_norm[2:ncol(DGEList_Svoboda_filtered_norm)]) <- labels_Svoboda#Data Aggregation ----
#Here, I will add all 6 of the pre-processed datasets into a list, then use purrr::reduce()
#to apply left_join() to all the elements of this list This will create a tibble of all 118 samples
#with a total of 14,528 observations (expressed genes). The number of observations comes from the Gosselin
#reference dataset, which had the largest number of samples (14,528).
#However, this introduces NA data into the aggregate dataset, since other component
#tibbles may have been missing some of those genes. Removing any genes for which any
#samples have NA vaues leads to a removal of 4269 genes, leading
#to a final number of observations of 10,259 in the aggregate dataset.
#For subsequent exploratory and DGE analysis, the generated tbl.df Aggregation_Set_Math will be used
#This tibble will also be written out to a .tsv for sharing offline/etc
Aggregation_List <- list(Gosselin_log2_cpm_filtered_norm, Eggen_log2_cpm_filtered_norm, KJS_iMG_MDM,
DGEList_Abud_filtered_norm, DGEList_Blurton_filtered_norm, DGEList_Svoboda_filtered_norm)
Aggregation_Set <- Aggregation_List %>%
purrr::reduce(left_join, by = "GeneID")
Aggregation_Set <- na.omit(Aggregation_Set) #remove NA probes
write_tsv(Aggregation_Set, "BMIN_503_Aggregated_Transcriptomes.tsv") #write to .tsv
#Move the gene names to rownames to eliminate the non numeric column, needed for PCA /
#Other statistical operations
Aggregation_Set_Math <- column_to_rownames(Aggregation_Set, var = "GeneID")
dim(Aggregation_Set_Math)## [1] 10259 118
#Create Study Design File for Aggregate set to facilitate DGE Analysis
#As well as further sharing of this report / dataset with colleagues.
#Will contain sample labels and factors corresponding to Study authors
#For each dataset, condition (eg ExVivo, InVitro, Transplanted, etc),
#And cell type (microglia or macrophage, latter included as a sort of negative control)
#Other categorical variables (EG HIV infectable? yes or no, etc) could be added to this study design file
#to check other relationships between the datasets. Expand on this concept later
labels_Aggregate <- colnames(Aggregation_Set_Math)
studies_Aggregate <- c(rep("Gosselin_2017", 24), rep("Eggen_2017", 39), rep("Ryan_2020", 6),
rep("Abud_2017", 15), rep("Blurton-Jones_2019", 16), rep("Svoboda", 18))
condition_Aggregate <- c(rep("ExVivo_Juvenile", 24), rep("ExVivo_Adult", 39), rep("InVitro_Monoculture", 6),
rep("InVitro_rHcN_Coculture", 6), rep("InVitro_Monoculture", 9),
rep("Xenografted", 10), rep("InVitro_Monoculture", 6), rep("Xenografted", 6),
rep("Xenografted_60d", 6), rep("InVitro_Monoculture", 6))
celltype_Aggregate <- c(rep("Microglia", 63), rep("Monocyte_Derived_Macrophages", 3),
rep("Microglia", 52))
Study_Design_Aggregate <- tibble(Sample_Labels = labels_Aggregate, Study = studies_Aggregate,
Conditions = condition_Aggregate, Cell_Types = celltype_Aggregate)
Study_Design_Aggregate <- Study_Design_Aggregate %>%
dplyr::mutate(Study = factor(Study, levels = c("Gosselin_2017", "Eggen_2017", "Ryan_2020",
"Abud_2017", "Blurton-Jones_2019", "Svoboda"))) %>%
dplyr::mutate(Conditions = factor(Conditions, levels = c("ExVivo_Juvenile", "ExVivo_Adult", "InVitro_Monoculture",
"InVitro_rHcN_Coculture", "Xenografted", "Xenografted_60d"))) %>%
dplyr::mutate(Cell_Types = factor(Cell_Types, levels = c("Microglia", "Monocyte_Derived_Macrophages")))
head(Study_Design_Aggregate)## # A tibble: 6 × 4
## Sample_Labels Study Conditions Cell_Types
## <chr> <fct> <fct> <fct>
## 1 J_Microglia_ExVivo1 Gosselin_2017 ExVivo_Juvenile Microglia
## 2 J_Microglia_ExVivo2 Gosselin_2017 ExVivo_Juvenile Microglia
## 3 J_Microglia_ExVivo3 Gosselin_2017 ExVivo_Juvenile Microglia
## 4 J_Microglia_ExVivo4 Gosselin_2017 ExVivo_Juvenile Microglia
## 5 J_Microglia_ExVivo5 Gosselin_2017 ExVivo_Juvenile Microglia
## 6 J_Microglia_ExVivo6 Gosselin_2017 ExVivo_Juvenile Microglia
Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.
#Exploratory Analysis (PCA)----
#Perform Principal Component Analysis (PCA) on the entire aggregated dataset.
#This is a common unsupervised ML approach for exploratory analysis
#of transcriptomic data. It allows us to visualize the degree of similarity / difference
#between various samples through dimensionality reduction.
pca_res_aggregate <- prcomp(t(Aggregation_Set_Math), scale. = F, retx = T)
pc_var_aggregate <- pca_res_aggregate$sdev^2 #Vector of Eigenvalues
pc_per_aggregate <- round(pc_var_aggregate/sum(pc_var_aggregate)*100, 1) #Percentages of variance per principal component, for plot labels
pca_res_df_aggregate <- as_tibble(pca_res_aggregate$x) #Loadings, influence of each PCA on each sample, main data for plots
#Plot Principal Components 1 and 2
pca_plot_aggregate_1_2 <- ggplot(pca_res_df_aggregate, aes(x = PC2, y = PC1,
color = condition_Aggregate, shape = studies_Aggregate)) +
geom_point(size = 3, alpha = .5) +
xlab(paste0("PC2 (", pc_per_aggregate[2], "%", ")")) +
ylab(paste0("PC1 (", pc_per_aggregate[1], "%", ")")) +
labs(title = "PCA Plot") +
stat_ellipse(aes(color = condition_Aggregate), #Add Elipse to help differentiate groups
linewidth = 0.9, level = 0.95)
ggplotly(pca_plot_aggregate_1_2)#Plot Principal Components 1 and 3
pca_plot_aggregate_1_3 <- ggplot(pca_res_df_aggregate, aes(x = PC3, y = PC1, color = condition_Aggregate, shape = studies_Aggregate)) +
geom_point(size = 3, alpha = .5) +
xlab(paste0("PC3 (", pc_per_aggregate[3], "%", ")")) +
ylab(paste0("PC1 (", pc_per_aggregate[1], "%", ")")) +
labs(title = "PCA Plot") +
stat_ellipse(aes(color = condition_Aggregate), #Add Elipse to help differentiate groups
linewidth = 0.9, level = 0.95) +
theme_bw()
ggplotly(pca_plot_aggregate_1_3)#Plot Principal Components 1 and 3 again, this time using celltype_Aggregate to distinguish Microglia and MDM in Ryan 2020 dataset
#This is a quick and dirty (and less easily interprable) work around
pca_plot_aggregate_1_3_fill_MDM <- ggplot(pca_res_df_aggregate, aes(x = PC3, y = PC1, color = condition_Aggregate, shape = studies_Aggregate,
fill = celltype_Aggregate)) +
geom_point(size = 3, alpha = .5) +
xlab(paste0("PC3 (", pc_per_aggregate[3], "%", ")")) +
ylab(paste0("PC1 (", pc_per_aggregate[1], "%", ")")) +
labs(title = "PCA Plot") +
stat_ellipse(aes(color = condition_Aggregate), #Add Elipse to help differentiate groups
linewidth = 0.9, level = 0.95) +
theme_bw()
ggplotly(pca_plot_aggregate_1_3_fill_MDM)#Plot Principal Components 2 and 3
pca_plot_aggregate_2_3 <- ggplot(pca_res_df_aggregate, aes(x = PC3, y = PC2, color = condition_Aggregate, shape = studies_Aggregate)) +
geom_point(size = 3, alpha = .5) +
xlab(paste0("PC3 (", pc_per_aggregate[3], "%", ")")) +
ylab(paste0("PC2 (", pc_per_aggregate[2], "%", ")")) +
labs(title = "PCA Plot") +
stat_ellipse(aes(color = condition_Aggregate), #Add Elipse to help differentiate groups
linewidth = 0.9, level = 0.95) +
theme_bw()
ggplotly(pca_plot_aggregate_2_3)#Use PCA coordinates ()
#PCA Euclidean distances to the primary microglia samples were calculated
#between all pairs of points in each group, using PC1 and PC2 on the top 500 genes.
#Take a stab at this by 1: making thr rownames of pca_res_df_aggregate be our 118 sample names
#2: Use only PC1 and PC2, just subset pca_res_df_aggregate
#3: Then use the distance matrix command below? unsure
distance_matrix <- dist(t(pca_res_df_aggregate), diag = TRUE)#Differential Gene Expression Analysis----
#Pairwise comparisons included in this section as of 12/2022:
#Adult ExVivo Microglia (Eggen 2017) vs Juvenile ExVivo Microglia (Gosselin 2017)
#Find differences between juvenile and adult microglia
#InVitro IPSC-Microglia (Blurton-Jones 2017) vs InVitro IPSC-Microglia (Ryan 2020)
#IPSC-Microglia from the Blurton-Jones study cluster closer to the reference datasets,
#DGE may be useful to show why
#Xenografted IPSC-Microglia (Blurton-Jones 2017) vs Xenografted IPSC-Microglia (Svoboda 2017)
#It is difficult to tell from the PCA but it looks like the Blurton-Jones xenografted cells slightly
#outperform those from the Svoboda study. DGE should show us some of what makes the former microglia more ”life-like”
#Xenografted IPSC-Microglia (Svoboda 2017) vs InVitro IPSC-Microglia (Ryan 2020)
#Surprisingly, xenografted Microglia from the Svoboda study and InVitro IPSC-Microglia from the Ryan
#study were of comparable distance from the reference data sets in the PCA plots, therefore worth further investigation
#Create a matrix of sample information and factor wrap to facilitate pairwise comparisons
matrix_DGE <- c(rep("J_Microglia_ExVivo", 24), rep("A_Microglia_ExVivo", 39), rep("Ryan_MDM_InVitro", 3),
rep("Ryan_iMG_InVitro", 3), rep("Abud_iMG_InVitro_rHcH1", 6), rep("Abud_iMG_InVitro_", 9),
rep("BlurtonJones_iMg_Xenotransplant", 10), rep("BlurtonJones_iMG_InVitro", 6),
rep("Svoboda_iMg_Xenotransplant", 6), rep("Svoboda_iMg_Xenotransplant_60d", 6), rep("Svoboda_iMg_InVitro", 6))
matrix_DGE <- factor(matrix_DGE, levels = c("J_Microglia_ExVivo", "A_Microglia_ExVivo", "Ryan_MDM_InVitro",
"Ryan_iMG_InVitro", "Abud_iMG_InVitro_rHcH1", "Abud_iMG_InVitro_",
"BlurtonJones_iMg_Xenotransplant", "BlurtonJones_iMG_InVitro",
"Svoboda_iMg_Xenotransplant", "Svoboda_iMg_Xenotransplant_60d", "Svoboda_iMg_InVitro"))
summary(matrix_DGE)## J_Microglia_ExVivo A_Microglia_ExVivo
## 24 39
## Ryan_MDM_InVitro Ryan_iMG_InVitro
## 3 3
## Abud_iMG_InVitro_rHcH1 Abud_iMG_InVitro_
## 6 9
## BlurtonJones_iMg_Xenotransplant BlurtonJones_iMG_InVitro
## 10 6
## Svoboda_iMg_Xenotransplant Svoboda_iMg_Xenotransplant_60d
## 6 6
## Svoboda_iMg_InVitro
## 6
design_matrix <- model.matrix(~0 + matrix_DGE)
colnames(design_matrix) <- levels(matrix_DGE)
#Use limma::voom() to model the mean:variance relationship in our dataset. Necessary
#for providing weights in downstream linear modeling (lmfit())
#Requires non-negative values, so will need to un log2 transform our aggregate dataset
#Aggregation_Set_Math_no_log still has filtered / normalized data but now the
#Units are just transcripts per million (TPM) instead of log2(TPM) which is
#typically reported
Aggregation_Set_Math_no_log <- 2^Aggregation_Set_Math
V_Aggregation_Set_Math <- voom(Aggregation_Set_Math_no_log, design_matrix, plot = TRUE)#Fir a linear model to the now weighted dataset
fit <- lmFit(V_Aggregation_Set_Math, design_matrix)
#Extract the expression data back out from the EList created by voom
#and redo log transformation. Contains some NA, may have to remove later if necessary
V_Aggregation_Set_Math_tbl <- (as_tibble(log2(V_Aggregation_Set_Math$E), rownames = "GeneID"))## Warning in as_tibble(log2(V_Aggregation_Set_Math$E), rownames = "GeneID"): NaNs
## produced
#Use our design matrix file to begin creating contrasts
#corresponding to our pairwise comparisons of interest
#Extracts the linear model created above, get's statistics on it
#using eBayes(), and then summarizes top differentially expressed genes / DEGs in a topTable
#Beginning with a comparison of the two reference data sets (Eggen and Gosselin)
contrast.matrix_Juv_Adlt <- makeContrasts(Juv_v_Adlt = A_Microglia_ExVivo - J_Microglia_ExVivo, levels = design_matrix) #looper here
fits_Juv_Adlt <- contrasts.fit(fit, contrast.matrix_Juv_Adlt)
ebFit_Juv_Adlt <- eBayes(fits_Juv_Adlt)
top_hits_Juv_Adlt <- topTable(ebFit_Juv_Adlt, adjust = "BH", coef = 1, number = 1500, sort.by = "logFC")
top_hits_Juv_Adlt <- as_tibble(top_hits_Juv_Adlt, rownames = "GeneID")
gene_names_top_1 <- top_hits_Juv_Adlt$GeneID
#Now Use top table to create an enhanced volcano plot and visualize differentially expressed genes
EnhancedVolcano(top_hits_Juv_Adlt, lab = gene_names_top_1, x = 'logFC',
y = 'adj.P.Val', title = "Adult vs Pediatric ExVivo MG",
pCutoff = 10e-6,
FCcutoff = 2.0,
pointSize = 3.0,
labSize = 6.0,
shape = c(1, 4, 23, 25),
colAlpha = 1)## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the EnhancedVolcano package.
## Please report the issue to the authors.
#now determine numbers of significant differentially expressed genes, summarize results
#And store all differentially expressed genes for this comparison (in either direction)
#In a data frame, and write those results to a tsv for future sharing or GSEA
results_Juv_Adlt <- decideTests(ebFit_Juv_Adlt, method="global", adjust.method="BH", p.value = 0.01, lfc = 2)
head(results_Juv_Adlt)## TestResults matrix
## Contrasts
## Juv_v_Adlt
## A2M 0
## AAAS 0
## AACS 0
## AAGAB 0
## AAK1 0
## AAMDC 0
## Juv_v_Adlt
## Down 68
## NotSig 9937
## Up 254
ddx_genes_Eggen_Adult<- V_Aggregation_Set_Math$E[results_Juv_Adlt[ , 1] != 0, ]
ddx_genes_Eggen_Adult <- as_tibble(ddx_genes_Eggen_Adult, rownames = "GeneID")
write_tsv(ddx_genes_Eggen_Adult, "ddx_Genes_Eggen_v_Gosselin.tsv") #write to .tsv
#Moving on to comparing Abud vs Ryan InVitro IPSC-Microglia
contrast.matrix_Abud_Ryan <- makeContrasts(Abud_v_Ryan = Ryan_iMG_InVitro - Abud_iMG_InVitro_, levels = design_matrix)
fits_Abud_Ryan <- contrasts.fit(fit, contrast.matrix_Abud_Ryan)
ebFit_Abud_Ryan <- eBayes(fits_Abud_Ryan)
top_hits_Abud_Ryan <- topTable(ebFit_Abud_Ryan, adjust = "BH", coef = 1, number = 1500, sort.by = "logFC")
top_hits_Abud_Ryan <- as_tibble(top_hits_Abud_Ryan, rownames = "GeneID")
gene_names_top_2 <- top_hits_Abud_Ryan$GeneID
#Now Use top table to create an enhanced volcano plot and visualize differentially expressed genes
EnhancedVolcano(top_hits_Abud_Ryan, lab = gene_names_top_2, x = 'logFC',
y = 'adj.P.Val', title = "Abud vs Ryan InVitro MG",
pCutoff = 10e-6,
FCcutoff = 2.0,
pointSize = 3.0,
labSize = 6.0,
shape = c(1, 4, 23, 25),
colAlpha = 1)#now determine numbers of significant differentially expressed genes, summarize results
#And store all differentially expressed genes for this comparison (in either direction)
#In a data frame, and write those results to a tsv for future sharing or GSEA
results_Abud_Ryan <- decideTests(ebFit_Abud_Ryan, method="global", adjust.method="BH", p.value = 0.01, lfc = 2)
head(results_Abud_Ryan)## TestResults matrix
## Contrasts
## Abud_v_Ryan
## A2M 0
## AAAS 0
## AACS 0
## AAGAB 0
## AAK1 0
## AAMDC 0
## Abud_v_Ryan
## Down 195
## NotSig 9931
## Up 133
ddx_genes_Abud_Ryan <- V_Aggregation_Set_Math$E[results_Abud_Ryan[ , 1] != 0, ]
ddx_genes_Abud_Ryan <- as_tibble(ddx_genes_Abud_Ryan, rownames = "GeneID")
write_tsv(ddx_genes_Abud_Ryan, "ddx_Genes_Abud_v_Ryan.tsv") #write to .tsv
#Moving on to comparing Xenotransplanted Microglia from Blurton-Jones to the 60D
#Xenotransplants from the Svoboda study
contrast.matrix_BJ_Svoboda <- makeContrasts(BJ_v_Svoboda = BlurtonJones_iMg_Xenotransplant - Svoboda_iMg_Xenotransplant_60d, levels = design_matrix)
fits_BJ_Svoboda <- contrasts.fit(fit, contrast.matrix_BJ_Svoboda)
ebFit_BJ_Svoboda <- eBayes(fits_BJ_Svoboda)
top_hits_BJ_Svoboda <- topTable(ebFit_BJ_Svoboda, adjust = "BH", coef = 1, number = 1500, sort.by = "logFC")
top_hits_BJ_Svoboda <- as_tibble(top_hits_BJ_Svoboda, rownames = "GeneID")
gene_names_top_3 <- top_hits_BJ_Svoboda$GeneID
#Now Use top table to create an enhanced volcano plot and visualize differentially expressed genes
EnhancedVolcano(top_hits_BJ_Svoboda, lab = gene_names_top_3, x = 'logFC',
y = 'adj.P.Val', title = "Blurton-Jones vs Svoboda Xenotransplanted MG",
pCutoff = 10e-6,
FCcutoff = 2.0,
pointSize = 3.0,
labSize = 4.0,
shape = c(1, 4, 23, 25),
colAlpha = 1)#now determine numbers of significant differentially expressed genes, summarize results
#And store all differentially expressed genes for this comparison (in either direction)
#In a data frame, and write those results to a tsv for future sharing or GSEA
results_BJ_Svoboda <- decideTests(ebFit_BJ_Svoboda, method="global", adjust.method="BH", p.value = 0.01, lfc = 2)
head(results_BJ_Svoboda)## TestResults matrix
## Contrasts
## BJ_v_Svoboda
## A2M 0
## AAAS 0
## AACS 0
## AAGAB 0
## AAK1 0
## AAMDC 0
## BJ_v_Svoboda
## Down 32
## NotSig 9850
## Up 377
ddx_genes_BJ_Svoboda <- V_Aggregation_Set_Math$E[results_BJ_Svoboda[ , 1] != 0, ]
ddx_genes_BJ_Svoboda <- as_tibble(ddx_genes_BJ_Svoboda, rownames = "GeneID")
write_tsv(ddx_genes_BJ_Svoboda, "ddx_Genes_BJ_v_Svoboda.tsv") #write to .tsv
#Moving on to comparing Xenotransplanted Microglia from Svoboda study to
#InVitro IPSC-Microglia from the Ryan Study
contrast.matrix_Svoboda_Ryan <- makeContrasts(Svoboda_v_Ryan = Ryan_iMG_InVitro - Svoboda_iMg_Xenotransplant_60d, levels = design_matrix)
fits_Svoboda_Ryan <- contrasts.fit(fit, contrast.matrix_Svoboda_Ryan)
ebFit_Svoboda_Ryan <- eBayes(fits_Svoboda_Ryan)
top_hits_Svoboda_Ryan <- topTable(ebFit_Svoboda_Ryan, adjust = "BH", coef = 1, number = 2500, sort.by = "logFC")
top_hits_Svoboda_Ryan <- as_tibble(top_hits_Svoboda_Ryan, rownames = "GeneID")
gene_names_top_4 <- top_hits_Svoboda_Ryan$GeneID
#Now Use top table to create an enhanced volcano plot and visualize differentially expressed genes
EnhancedVolcano(top_hits_Svoboda_Ryan, lab = gene_names_top_4, x = 'logFC',
y = 'adj.P.Val', title = "Svoboda Xenotransplanted MG vs Ryan InVitro IPSC-MG",
pCutoff = 10e-6,
FCcutoff = 2.0,
pointSize = 1.5,
labSize = 4.0,
shape = c(1, 4, 23, 25),
colAlpha = 0.5)#now determine numbers of significant differentially expressed genes, summarize results
#And store all differentially expressed genes for this comparison (in either direction)
#In a data frame, and write those results to a tsv for future sharing or GSEA
results_Svoboda_Ryan <- decideTests(ebFit_Svoboda_Ryan, method="global", adjust.method="BH", p.value = 0.01, lfc = 2)
head(results_Svoboda_Ryan)## TestResults matrix
## Contrasts
## Svoboda_v_Ryan
## A2M -1
## AAAS 0
## AACS 0
## AAGAB 0
## AAK1 0
## AAMDC 0
## Svoboda_v_Ryan
## Down 349
## NotSig 8803
## Up 1107